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Abstract 



Accumulation of standardized data collections is opening up novel opportunities 
for holistic characterization of genome function. The limited scalability of current 
preprocessing techniques has, however, formed a bottleneck for full utilization of con- 
temporary microarray collections. While short oligonucleotide arrays constitute a ma- 
jor source of genome- wide profiling data, scalable probe-level preprocessing algorithms 
have been available only for few measurement platforms based on pre-calculated model 
parameters from restricted reference training sets. To overcome these key limitations, 
we introduce a fully scalable online-learning algorithm that provides tools to process 
large microarray atlases including tens of thousands of arrays. Unlike the alterna- 
tives, the proposed algorithm scales up in linear time with respect to sample size and 
is readily applicable to all short oligonucleotide platforms. This is the only available 
preprocessing algorithm that can learn probe-level parameters based on sequential hy- 
perparameter updates at small, consecutive batches of data, thus circumventing the 
extensive memory requirements of the standard approaches and opening up novel op- 
portunities to take full advantage of contemporary microarray data collections. More- 
over, using the most comprehensive data collections to estimate probe-level effects can 
assist in pinpointing individual probes affected by various biases and provide new tools 
to guide array design and quality control. The implementation is freely available in 
R/Bioconductor at http:/ /www. bioconductor.org/packages/devel/bioc/html/RPA. html. 
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1 Introduction 



Accumulation of research data in in- house and pubhc repositories, such as the Array Express 
(Parkinson et at, 2010) and Gene Expression Omnibus (Barrett et al, 2005) has created 
data collections that include tens of thousands of microarray experiments originating from 
standardized measurement platforms (Parkinson et al., 2010), helping to overcome some of 
the inherent biases in meta-analyses involving multiple measurement platforms (Kilpinen 
et al., 2008; Rudy and Valafar, 2011). By combining data from hundreds of individual stud- 
ies and thousands of commensurable microarray experiments it is possible to obtain a more 
holistic picture of genome function and carry out detailed investigations of targeted model 
systems and diseases that can benefit from large sample sizes of the new data collections 
(Lukk et al., 2010). A major portion of contemporary microarray collections originates 
from short oligonucleotide microarrays (Lockhart et al, 1996), whose applications range 
from gene expression profiling (Parkinson et al, 2010) to alternative splicing and microbial 
community analysis (Brodie et al, 2007; Rajilic-Stojanovic et al, 2009; Nikkila and de Vos, 

2010) . With nearly a million arrays in the ArrayExpress database, being able to combine 
and analyse very large sets of arrays together is a key challenge with a variety of applica- 
tions (Kohane and Valtchinov, 2012; Lukk et al, 2010; Schmid et al., 2012; Zheng-Bradley 
et al, 2010). 

Reliable preprocessing of the data is central for investigations. While multi-array prepro- 
cessing techniques that combine information across multiple arrays to quantify probe effects 
have led to improved preprocessing performance (Bolstad et al, 2003), the applicability of 
the standard multi-array techniques, such as RMA (Irizarry et al, 2003), GC-RMA (Wu 
and Irizarry, 2004), MBEI (Li and Wong, 2001) and PLIER (Affymetrix Inc., 2005), has 
been limited to a few thousand arrays due to the considerable memory requirements associ- 
ated with processing up to a million probe-level features per a single array. This has formed 
a bottleneck for comprehensive meta-analysis of contemporary microarray data collections. 
Scalable variants of the standard preprocessing algorithms have been recently developed 
to tackle these shortcomings. The scalable refRMA (Katz et al, 2006) and fRMA (McCall 
et al., 2010) algorithms rely on pre-calculated probe effect terms that are estimated from 
restricted reference training sets and then extrapolated to preprocess further microarray 
data. The applicability of these methods has, however, been limited to few specific mi- 
croarray platforms with pre-calculated probe eff'ect terms; while the ArrayExpress database 
lists 203 distinct Affymetrix platforms, fRMA is currently available only for a single mouse 
platform and 2 human platforms (McCall and Irizarry, 2011). No platform-independent 
preprocessing methods have been available to date to extract and utilize complete probe- 
level information in large-scale sample collections in a scalable manner. 

We introduce a fully scalable probe-level model for multi-array preprocessing based on 
Bayesian online-learning to overcome the key limitations of the current approaches. We 
have extended the Robust Probabilistic Averaging framework introduced in (Lahti et al., 

2011) to estimate probe affinities and to incorporate prior information of the probes in the 
analysis. This provides the basis for scalable online-learning based on sequential hyperpa- 
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rameter updates, allowing rigorous preprocessing of large microarray atlases on an ordinary 
desktop computer in small consecutive batches with minimal memory requirements and in 
linear time with respect to sample size. The proposed Online-RPA algorithm (RPA) pro- 
vides the means to extract probe-level information across arbitrarily large data collections 
and it is readily applicable to all short oligonucleotide microarray platforms. Moreover, 
the analysis of probe performance can now be based on the most comprehensive collections 
of short oligonucleotide array data, which provides useful information also for microarray 
design and quality control. To our knowledge this is the only probe-level preprocessing 
method which is both fully scalable and platform- independent, providing new tools to take 
full advantage of contemporary microarray data collections. 

2 Results 

2.1 Scalable preprocessing with online-learning: an overview 

The standard steps in microarray preprocessing include background correction, normaliza- 
tion, and probe summarization. Probe-level procedures that combine information across 
multiple arrays have been found to improve preprocessing performance (Bolstad et al, 
2003) but their applicability to large sample collections has been limited due to huge mem- 
ory requirements associated with increasing sample sizes. The available solutions have been 
based on learning and extrapolation of probe-level effects from smaller reference training 
sets (Katz et al, 2006; McCall et al, 2010). We propose an alternative online- learning pro- 
cedure that can extract probe-level information across the complete microarray collection 
with minimal memory requirements and in linear time with respect to sample size based 
on Bayesian hyperparameter updates. Assuming that appropriate microarray quality con- 
trols have been applied prior to the analysis, the standard steps of background correction, 
normalization, and probe summarization are applied to consecutive batches of the data in 
four sweeps over the complete data collection: 

Step 1: Background correction In the first step, each individual array is background- 
corrected with the standard RMA background correction (Irizarry et al., 2003). The 
processed data can be stored temporarily on hard disk to speed up preprocessing in the 
following steps that operate on background-corrected data. 

Step 2: Quantile basis estimation The base distribution for quantile normalization is ob- 
tained as the average over sorted probe-level signals from background-corrected data (Bol- 
stad et al, 2003). We have implemented a scalable version of standard quantile normal- 
ization where the initial base distribution is calculated from the first batch and updated 
at each new batch by taking weighted average between the current base distribution and 
the one from the new batch, weighted by their corresponding sample sizes. The final base 
distribution is identical to the one which would be obtained by jointly normalizing all 
arrays in a single batch. Similar approach is used in parallel implementations of RMA 
where the quantile basis from individual batches are calculated and applied to the com- 
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plete data collection after averaging over the individual basis distributions (Schmidberger 
et ai, 2009). 

Step 3: Hyperparameter estimation The key novelty of our approach is in introducing the 
online-learning model for consecutive probe-level hyperparameter updates. The background- 
corrected batches are normalised with the quantile base distribution, log-transformed, and 
finally used to update the model parameters. At the first batch, the model can be initial- 
ized by giving equal priors for the probes if no probe-specific prior information is available. 
The probe-level hyperparameters are then updated at each new batch, providing priors 
for the next batch. The final probe-level parameters are obtained after scanning over all 
batches in the data collection. Ideally, this is expected to yield an identical result with a 
single batch approach. 

Step 4- Probe summarization In the last step, the final probe-level parameters obtained 
from the third step are applied to summarize the probes in each batch, yielding the final 
preprocessed data matrix. 

2.1.1 The probe-level model 

Let us summarize the probe-level model for a fixed probeset with J probes measured at 
T+1 arrays. The model is based on background-corrected, normalized, and log-transformed 
probe-level data. We assume a Gaussian model for probe effects, where the signal Sij of 
probe j £ {1, . . . , J} in sample i G {1, . . . , T + 1} is modeled as a sum of the underly- 
ing target signal and Gaussian mean and variance parameters /Xj, tJ that are directly 
interpretable as constant affinity iJ,j and stochastic noise £ij ~ iV(0,rJ), respectively: 



In the following, let us describe the estimation procedure for the model parameters a = 
[«!,..., ar+i] , M = , • • • 5 A* j] > '''^ = [''"i > ■ • • > ''"j] > ^-^id the additional modeling assumptions 
that are necessary to circumvent unidentifiability of the probe affinity terms. 

2.1.2 Incorporating prior information of the probes 

We start the analysis by estimating the variance parameters r^, following the procedure in 
(Lahti et al, 2011). In summary, the analysis is based on probe-level differential expression 
signal between each sample t = [1,... ,T] and a randomly selected reference sample r. 
Then, given Eq. 1, the unidentifiable affinity parameters /ij cancel out, yielding stj — Srj — 
{at — Or) + {etj — £rj)- Denoting mt = st — Sr, dt = at — ar, and applying the vector notation 
m = [mi, . . . ,m5"], d = [di,. . . ,dT], the full posterior density for the model parameters 
d, is obtained with the Bayes' rule as 



ai + fij + Eij. 
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P{d,T'^\m) ^ P{m\d,T^)P{d,T^). (2) 

Assuming independent observations nij, given the model parameters, and marginalizing 
over £rj, the likelihood term in Eq. 2 is then given (Lahti et al, 2011) by 



P(m|d,T2) = J] J N{mtj\dt - erj,rf)Nierj\0,Tf)de 

Etimt, - - ^^^^^^ 



tj 

T 



(3) 



JJ(27rrJ) 2 exp{ 



With non-informative priors for P{d, r^) the posterior of Eq. 2 would reduce to maximum- 
likelihood-estimation of Eq. 3 as in (Lahti et al, 2011). The Bayesian online-learning 
version, developed and validated in this paper, takes full advantage of the prior term. This 
forms the basis for sequential updates of the posterior in Eq. 2. Assuming independent 
prior terms, a non-informative prior P{d) ~ 1, and inverse Gamma conjugate priors for 
the variances with probe-specific hyperparameters Oj and Pj (Gelman et al, 2003), the 
prior takes the form 



P{d,T') = P{d)P{T') r.l[r-\rf;a„f^,). (4) 



The posterior in Eq. 2 is now fully specified given the likelihood (Eq. 3), the prior (Eq. 4), 
and the hyperparameters a = [ai, . . . , aj], (3 = [/3i, . . . , /3j]. 

Our primary interest is in estimating the probe-specific variances r^, while d is a nuisance 
parameter that could be marginalized out from the model to obtain more robust esti- 
mates of r^. Since no analytical solution is available and sampling-based marginalization 
approaches would slow down computation, we find a single point estimate for the joint pos- 
terior in Eq. 2 as a fast approximation. Point estimates are found by iterative optimization 
of d and r^. A mode for d, given r^, is searched for by a standard quasi-Newton optimiza- 
tion method (Goldfarb, 1970). Then, given d, the priors {a,P) and sample size T + 1, the 
variance tJ follows inverse Gamma distribution with hyperparameters aj = aj + ^ and 

= /3, + UEtimj - dtf - ^^*^7|^''»' )), yielding 

P(T||m,d) -r-n-rllttiJi). (5) 

The point estimate for rj in this model is readily given by the mode at r? = /3j/(aj + 1). 
We give equal weight for all probes by initializing the process with r? = /3/(q + 1) with 
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a = /3 = 1 for all j if specific prior information is not available. The parameters d and 
are then iteratively updated until convergence (< 0.01 change in parameter values in 
our experiments). The inverse Gamma hyperparameters corresponding to the final can 
then be retrieved as aj = aj + ^ and (3j = Tj{aj + 1). 

2.1.3 Online-learning of variance hyperparameters 

The above formulation allows incorporation of prior information of the probes in the anal- 
ysis and sequential updates where the updated hyperparameters a, fB from the previous 
batch provide priors for the next batch through Eq. 2 and the prior in Eq. 5. In the ab- 
sence of prior information we shall give equal weight for all probes j at the first batch by 
setting aj = 1; f3j = 1 for all j. The hyperparameters a, /3 are then updated with new 
observations at each batch until the complete data collection has been screened through, 
yielding the final probe-level hyperparameters. 

2.1.4 Affinity estimation and probe summarization 

The remaining task after learning the probe-specific variance hyperparameters is to esti- 
mate the probeset-level signal a and probe affinities /i in Eq. 1. Unidentifiability of probe 
affinities is a well-recognized issue in microarray preprocessing (Irizarry et al, 2003), and 
further assumptions are necessary to formulate an identifiable model. A standard approach, 
used in the widely used RMA algorithm (Irizarry et al, 2003) is to assume that the probes 
capture the underlying signal correctly on average and the probe affinities sum to zero: 
TijUj = 0. We propose a more flexible probabilistic approach where this hard constraint 
is replaced by soft priors that keep the expected probe affinities at zero and allow higher 
deviations for the more noisy probes with a higher variance tJ. To implent this we apply a 
Gaussian prior fij --^ A(0,rj) for the affinities in Eq. 1. This allows higher fluctuations for 
the more noisy probes, yielding a better fit between the probeset-level signal estimate a 
and the less noisy probes with smaller tJ. The prior expectation of the sum of probe affini- 
ties, '^jUj, will remain at zero but in contrast to RMA, deviations from this are allowed 
if supported by the data. Alternatively, the affinity priors could be determined based on 
known probe-specific factors, such as GC-content which is a key element in probe affinity 
estimation in the GC-RMA algorithm (Wu and Irizarry, 2004). As probe performance 
is affected by a number of factors, however, we prefer the data-driven approach which 
can accommodate various, potentially unknown probe-level noise sources. Point estimates 
for the probe affinities are identified based on the following procedure. From Eq. 1 we 
have flj = Sij — fij — Sij ~ N{sij,2Tj). A maximum-likelihood estimate for aj is then 
obtained as a weighted sum of Sij over the probes j, weighted by the inverse variances: 
Gi = ^1 E,-:-^(sj,). Given the estimated Oj, the corresponding maximum- likelihood es- 

J 

(i) 

timate for each is then given by jij = Sij — ai. Robust estimates of are obtained 
by taking an average over affinity estimates across multiple samples, yielding the final es- 
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timates = [^ui, . . . , ^uj]. This completes the specification of the model and allows probe 
summarization according to Eq. 1. Given /x, r^, the final estimate for the probeset-level 
signal Oj is now readily obtained by Eq. 1 as the weighted sum of Sij — fij over the probes 
j, weighted by the inverse variances: = T,j^(sij — fJ'j)- 

j 

2.2 Validation 

In the following, let us investigate the scalability and parameter convergence of the new 
online-learning algorithm. The model performance is further validated by comparisons to 
alternative preprocessing methods with standard benchmarking procedures based on the 
AffyComp spike-in data sets (Cope et al., 2004) and a large-scale human gene expression 
atlas (Lukk et al, 2010). 

2.2.1 Scalability and parameter convergence 

Ideally, the online-learning procedure is expected to yield identical results with the stan- 
dard single-batch algorithm. We confirmed this by comparing the results obtained with 
the regular single-batch and online-learning versions of the RPA algorithm for a moder- 
ately sized data set of 300 randomly selected samples, which can be preprocessed with 
both models. The probeset-level signal estimates (a) correlated to a high degree (Pearson 
correlation r > 0.995; p < 10~^) between the single-batch and online-learning versions. 
The selected batch size may affect the running time depeding on the available memory 
resources but the results were robust to varying batch sizes of 10, 50, 100 and 300 samples 
(data not shown); we have used a batch size of 50 samples in the experiments. The high 
correspondence between the single-batch and online-learning models confirms the technical 
validity of the online version. 

Parameter convergence depends on the versatility of the data collection and overall probe- 
level noise, which can vary between probesets (McCall et al, 2010). For certain probesets, 
the probe parameters start to convergence before 2000-3000 samples or later, indicating 
that the sample sizes of ~ 1000 arrays typically applied with fRMA (McCall et al, 2010; 
McCall and Irizarry, 2011) may in some cases be be too low to ensure parameter conver- 
gence (Supplementary Fig. 1). 

The scalability of Online-RPA was investigated by preprocessing up to 20,000 HG-U133A 
CEL files from ArrayExpress. The model scales up linearly with respect to sample size 
(Supplementary Fig. 2), with 10 hours preprocessing time for 20,000 CEL files on a standard 
desktop (Z400 with four 3.06 GHz processor cores). 
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2.2.2 Comparison with alternatives 

The currently available approaches for scalable preprocessing can be classified in (i) tra- 
ditional single-array preprocessing methods and (ii) frozen multi-array techniques. In 
single-array algorithms, the preprocessing is performed separately for each array. Such 
approaches are fully scalable but cannot combine probe-level information across multiple 
arrays, limiting their accuracy compared to multi-array procedures (Bolstad et al, 2003). 
We include the MASS algorithm (Affymetrix Inc., 2005) as the reference as this is one of 
the most well-known single-array preprocessing techniques. The frozen multi-array tech- 
niques include the refRMA (Katz et al, 2006) and fRMA (McCaU et al, 2010) algorithms. 
To our knowledge, fRMA (McCall et al, 2010), which incorporates ideas from refRMA 
(Katz et al, 2006), is the only available multi-array preprocessing algorithm for scalable 
preprocessing of large-scale microarray collections. The fRMA is based on a standardized 
database of pre-calculated probe effect terms, which are then applied to preprocess new 
arrays in the database. The estimation procedure for probe effects is not scalable, however, 
and applicability of fRMA is currently limited to three Affymetrix platforms. In addition 
to the standard "single-array" fRMA model (fRMA), we consider a second variant which 
includes an additional model for batch effects (fRMA-batch) . This incorporates additional 
experiment-specific information to the analysis which cannot be utilized by the other meth- 
ods. While this can improve preprocessing performance, batch information is not neces- 
sarily available for heterogeneous microarray collections, and sets further requirements for 
the application of the fRMA-batch procedure. Finally, we include in the comparisons the 
widely used RMA algorithm (Irizarry et al, 2003). In contrast to the other approaches 
RMA is not fully scalable, but the RMA preprocessed version of the Lukk et al (2010) 
data set is readily available, and as one of the most widely used standard preprocessing 
algorithms this is a relevant reference. 

2.2.3 Spike-in experiments 

The AffyComp website (http://affycomp.biostat.jhsph.edu; Cope et al (2004)) provides 
standard benchmarking tests for microarray preprocessing based on spike-in experiments 
on the Affymetrix HG-U95av2 and HG-U133A_tag platforms. The tests are designed to 
quantify the relative sensitivity and accuracy of the alternative preprocessing algorithms 
based on known target transcript concentrations. We focus here on the scalable MASS, 
fRMA, and RPA algorithms since most other methods at the AffyComp website have 
been designed for moderately sized data sets and have therefore a different scope than the 
scalable approaches. However, fRMA is not applicable to the spike-in data sets as pre- 
calculated probe-effect vectors are not available for these microarray platforms. Therefore 
we have included in the comparisons the widely used RMA algorithm which has a closely 
related probe-level model and is the key preprocessing method in the field. All arrays 
were preprocessed in a single batch with RPA. The Figure 1 summarizes the 14 Affy- 
Comp benchmarking tests for MAS5, RMA, and RPA. The complete comparison results 



9 



for various preprocessing algorithms are available the AfFyComp website . 

RPA outperformed RMA in 13 and 11 tests (out of 14) on the HG-U95Av2 and HG- 
U133A_tag data sets, respectively (Figure 1). In particular, RPA outperformed RMA 
with respect to bias (AffyComp tests 5-10; Supplementary Fig. 3) and true positive-false 
positive rate, quantified by ROC/AUC analysis (AffyComp tests 11-14); the differences 
between RPA and the other methods were particularly salient with low concentration 
targets (Fig. 2). In the other benchmarking tests, RPA and RMA had comparable per- 
formance. Interestingly, MASS had the smallest bias although RPA and RMA in general 
outperformed this method, and in certain tests such as ROC/AUC analysis MAS5 failed 
to distinguish the spike-in transcripts from noise. Certain methods in general outperform 
RPA at the AffyComp benchmarking tests, including the FARMS (Hochreiter et al, 2006) 
and GCRMA (Wu et al, 2003) algorithms. These methods have a more limited scalability 
than RPA, however, and hence a different scope. 

In summary, the fully scalable RPA had a similar or improved preprocessing performance 
in spike-in data sets compared to the standard RMA and MAS5 algorithms, and a wider 
scope than the only scalable probe-level alternative, fRMA, which is not available for the 
spike-in platforms (McCall and Irizarry, 2011). 

2.2.4 Classification performance 

We investigated the classification performance of the comparison methods based on the 
(Lukk et al, 2010) data collection of 5,372 samples with 369 cell and tissue types, disease 
states and cell lines (Fig. 3). A random forest classifier (Breiman, 2001) was trained to 
distinguish between these classes based on 1000 randomly selected probe sets and 500 trees 
at 10 cross-validation folds, where the data was split into training (90%) and test (10%) sets. 
The singleton classes (150 samples) were excluded. The comparisons were performed with 
both the standard Affymetrix probe sets and alternative probesets based on the Ensembl 
Gene (ENSG) identifiers. RPA outperformed RMA and MAS5 (p < 0.05; paired Wilcoxon 
test). Differences between RPA and fRMA were not significant, and the fRMA with batch 
effect model (fRMA-batch) outperformed the other methods (p < 0.05). It is important to 
note, however, that fRMA and fRMA-batch algorithms have a considerably more limited 
scope than Online-RPA, as detailed in Discussion. Further details on the data, class and 
batch definitions are provided in Materials and Methods. 

2.2.5 Correlation between technical gene replicates 

The standard Affymetrix arrays contain multiple probesets for certain transcripts. As 
the final benchmarking test, we compared the probeset-level summaries for such technical 
gene replicates on the (Lukk et al, 2010) data set, following Elo et al. (2005); Lahti et al. 

http://affycomp.biostat.jhsph.edu/AFFY3/comp _form.html 
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(2011). Pearson correlation was calculated for each Affymetrix probeset pair sharing the 
same EnsembllD (Bioconductor package hgulSSa.db). The average correlations over all 
pairs were as follows: MAS5 0.46, RMA 0.53, fRMA 0.51, fRMA.batch 0.55 and RPA 0.54. 
The differences between the methods were significant (paired Wilcoxon test p < 0.01). In 
this comparison, RPA outperformed MAS5, RMA and fRMA (Supplementary Fig. 4). 



3 Discussion 



High memory requirements of the standard preprocessing techniques for short oligonu- 
cleotide arrays have limited their applicability to moderately sized data sets. Scalable 
probe-level preprocessing techniques, the refRMA (Katz et al, 2006) and fRMA (McCall 
et al., 2010), have been available only for few microarray platforms based on precalculated 
probe effects from restricted reference training sets. The lack of scalable, general-purpose 
preprocessing algorithms is hence forming a bottleneck for large-scale meta-analyses and 
full utilization of contemporary microarray collections. Our proposed method provides 
novel tools for taking advantage of the full information content in these data collections. 
With nearly a million arrays in the ArrayExpress database, being able to combine and 
analyse very large sets of arrays together is a key challenge with a variety of applications 
ranging from gene expression profiling (Lukk et al, 2010; Kohane and Valtchinov, 2012; 
Parkinson et al, 2010; Schmid et al, 2012; Zheng-Bradley et al, 2010) to alternative splic- 
ing and microbial community analysis (Brodie et al, 2007; Rajilic-Stojanovic et al, 2009; 
Nikkila and de Vos, 2010). 

We have introduced the first fully scalable online-learning algorithm that can extract and 
utilize individual probe effects across arbitrarily large microarray collections, overcoming 
the key scalability limitations of the current preprocessing techniques. The algorithm learns 
probe-level parameters by sequential hyperparameter updates, processing the data in small 
consecutive batches. This makes the model readily applicable to contemporary microarray 
collections with tens of thousands of samples by extending the probabilistic framework 
introduced in Lahti et al. (2011). In contrast to the alternatives our model is fully scal- 
able and readily applicable to all microarray platforms as its application does not depend 
on precalculated probe effect terms. The model scales up in linear time with respect to 
sample size, with 10 hours running time for 20,000 GEL files in our experiments. The 
running time could be further accelerated by optimizing the implementation, using more 
efficient processors and parallelizing with multiple cores (Schmidberger et al, 2009). Inter- 
estingly, we noticed that averaging of the probes, weighted by the probe-specific variance 
estimates provided by RPA, yielded similar results with the full model that additionally 
includes probe affinities. The differences between the two models were not significant, in- 
dicating that affinity analysis could be omitted to further speed up computation without 
compromising preprocessing performance. In analogy to fRMA, our model also allows the 
utilization of estimated model parameters as priors to preprocess further data sets. This 
can provide the advantages of single-chip methods and fRMA of not having to recompute 
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the whole preprocessing procedure when new arrays are included in the data collection. 
Providing frozen parameter estimates can not only speed up computations but also allow 
reproducible analysis of single arrays for diagnostic and other purposes, as suggested in 
(McCall et al, 2010). 

Probe performance is affected by RNA degradation, non-specific hybridization, GC- and 
SNP-content, annotation errors and other, potentially unknown factors. While modeling of 
the probe effects have been shown to yield improved estimates of the target signal (Irizarry 
et al, 2003; Li and Wong, 2001), the various sources of the probe-level noise remain poorly 
understood. While our model can be applied to learn the probe-level parameters based 
on subsets of the data, with the fully scalable extension the analysis of probe performance 
can now be based on the most comprehensive collections of short oligonucleotide array 
data that are available in the public repositories. As such, RPA can assist in nailing down 
individual probes affected by various biases, giving tools to guide microarray preprocessing 
and probe design in future studies and industry standards (Katz et al, 2006). 

The widely used RMA algorithm is a special case of our single-batch model, assuming 
that the affinities sum to zero and all probes within a probeset have identical variances. 
The recent scalable extension, fRMA (McCall et al, 2010), has a more detailed model 
for probe-specific variances. While RPA was comparable to or outperformed the standard 
fRMA algorithm in our experiments, the fRMA-batch which utilizes additional sample 
metadata outperformed RPA. It is important to note, however, that the modeling of batch 
effects in fRMA is only possible when sufficient sample metadata is available which is not 
always the case with large and heterogeneous microarray collections, which are the primary 
target of scalable preprocessing algorithms. Moreover, batch effects could be modeled in 
a separate step for instance based on linear models (Chen et al, 2011; Leek et al, 2012). 
Comparison of the various batch effect modeling techniques is out of the scope of this paper, 
however. Finally, the applicability of the fRMA and fRMA-batch algorithms is currently 
limited to only three Affymetrix platforms: HG-U133Plus2.0, HG-U133A, and MG-430 2.0 
(McCall and Irizarry, 2011), and could therefore not be included in the standard AffyComp 
spike-in benchmarking comparisons which rely on other microarray platforms. Therefore 
fRMA and fRMA-batch have a more limited scope than RPA. While certain methods, such 
as FARMS (Hochreiter et al., 2006) and GCRMA (Wu et al, 2003) with more detailed 
probe-level models outperform RPA in spike-in experiments, their scalability and hence 
the scope are more limited. 

The introduced online-learning approach therefore remains currently the only fully scalable 
preprocessing algorithm that is readily applicable on all short oligonucleotide platforms, 
outperforms the standard RMA algorithm, and can be used for scalable probe-level analysis 
and preprocessing of gene expression atlases involving tens of thousands of samples. This 
provides new tools to scale up investigations and to take advantage of the full information 
content of the rapidly expanding microarray data collections. 
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4 Methods and Data Access 



4.1 Data 

We have selected for comparisons a reasonably large, well-annotated and quality assessed 
microarray data set including 5372 human samples from a versatile collection of 369 cell and 
tissue types, disease states and cell lines from 206 public experiments and 162 laboratories, 
measured with the Affymetrix HG-U133A microarray (Lukk et al, 2010). The biological 
groups are of varying sizes, and include 150 classes with only one sample (singleton classes); 
the annotations describing the group of each sample in the data set can be retrieved from 
the ArrayExpress archive (accession number: E-MTAB-62)^. This data set is ideal for 
benchmarking of scalable preprocessing methods since the alternative fRMA preprocessing 
model depends on the availability of precalculated probe effect terms, which are available 
for this microarray. Moreover, sufficient sample metadata is available for this data set to 
include batch effects in the fRMA model. In addition, despite of the heterogeneous origin 
of the data set, which made unfeasible to obtain 'batches' in strictly the same manner as 
defined in (McCall et al, 2010), we could approximate them with the following approach. 
For each array within each experiment in the data set we retrieved the creation date of the 
GEL file from its HEADER section, under the DatHeader TAG, and assigned to the same 
batch those arrays from the same experiment (and laboratory) that were scanned on the 
same day. Thus, it was possible to assess the two available versions of fRMA, specifically the 
"single-array" and "batch-of-arrays" . Moreover, the sample size allows comparisons with 
the standard RMA preprocessing method (Irizarry et al, 2003). Finally, in addition to 
the standard Affymetrix probe sets we have included in the comparisons alternative probe 
sets based on updated Ensembl gene mappings available through the hgul33ahsensgcdf 
(14.1.0) annotation package (Dai et al, 2005). The reference probe effects for fRMA and 
the alternative mapping of probes to genes were built using the Bioconductor frmaTools 
(McGah and Irizarry, 2011). 

4.2 AffyComp spike-in experiments 

The score components in Figure 1 are as follows: (1) median SD across replicates; (2) 
Inter-quartile range of the log-fold-changes from genes that should not change; (3) 99.9% 
percentile of the log-fold-changes if from the genes that should not change; (4) obtained 
from regressing expression values on nominal concentrations in the spike-in data; (5) slope 
obtained from regressing expression values on nominal concentrations in the spike-in data; 
(6) slope from regression of observed log concentration versus nominal log concentration 
for genes with low intensities; (7) same for genes with medium intensities; (8) same for 
genes with high intensities; (9) slope obtained from regressing observed log-fold-changes 
against nominal log- fold-changes; (10) slope obtained from regressing observed log-fold- 
changes against nominal log-fold-changes for genes with nominal concentrations less than 

^http://www.ebi.ac.uk/gxa/experimentDesign/E-MTAB-62 



13 



or equal to 2; (11) area under the ROC curve (AUG; up to 100 false positives) for genes 
with low intensity standardized so that the optimum is 1; (12) AUG for genes with medium 
intensities; (13) AUG for genes with high intensities; (14) a weighted average of the ROG 
curves 11-13 with weights related to amount of data in each class. For full details, see Gope 
et al. (2004). 
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7 Figure Legends 



Figure 1: The benchmarking statistics for the AffyCompIII spike- in data for RPA, RMA, 
and MASS for the HG-U95Av2 (top) and HG-U133A_tag (bottom) platforms. RPA and 
MASS represent fuhy scalable algorithms, and the standard RMA algorithm has been 
included as a benchmark since its fully scalable extension, fRMA, is not available for the 
spike-in platforms. For clarity of presentation, we have transformed the scores 1-3 with 
1 — x; so that the score value of 1 corresponds here to ideal performance at all 14 scores. 
For a full description of the 14 benchmarking components, see Materials and Methods. 

Figure 2: Average ROC curves for low-abundance targets with nominal concentrations at 
most 4 picoMolar and nominal fold changes at most 2 in the AffyCompIII spike-in data for 
MASS (solid hue), RMA (dashed line), and RPA (dotted hue) on the HG-U9SAv2 (left) 
and HG-U133A_tag (right) platforms. The Figure has been adapted from AffyCompIII 
test SC. For details, see Cope et al. (2004). 

Figure 3: Classification performance in Lukk et al. data set (Lukk et al, 2010) for the com- 
parison algorithms. The S372 samples were classified into 369 cell and tissue types, and af- 
ter excluding the singleton classes the classification performance was quantified by random 
forest classifier based on 10-fold cross-validation as described in the Results. Online-RPA 
outperforms RMA and MASS {p < O.OS). Differences between RPA-online and fRMA are 
not significant, and fRMA-batch outperforms the other methods {p < O.OS). 
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